8 research outputs found
Automatic grammar induction from free text using insights from cognitive grammar
Automatic identification of the grammatical structure of a sentence is useful in many Natural Language
Processing (NLP) applications such as Document Summarisation, Question Answering systems and
Machine Translation. With the availability of syntactic treebanks, supervised parsers have been
developed successfully for many major languages. However, for low-resourced minority languages with
fewer digital resources, this poses more of a challenge. Moreover, there are a number of syntactic
annotation schemes motivated by different linguistic theories and formalisms which are sometimes
language specific and they cannot always be adapted for developing syntactic parsers across different
language families.
This project aims to develop a linguistically motivated approach to the automatic induction of
grammatical structures from raw sentences. Such an approach can be readily adapted to different
languages including low-resourced minority languages. We draw the basic approach to linguistic analysis
from usage-based, functional theories of grammar such as Cognitive Grammar, Computational Paninian
Grammar and insights from psycholinguistic studies. Our approach identifies grammatical structure of a
sentence by recognising domain-independent, general, cognitive patterns of conceptual organisation
that occur in natural language. It also reflects some of the general psycholinguistic properties of parsing
by humans - such as incrementality, connectedness and expectation.
Our implementation has three components: Schema Definition, Schema Assembly and Schema
Prediction. Schema Definition and Schema Assembly components were implemented algorithmically as
a dictionary and rules. An Artificial Neural Network was trained for Schema Prediction. By using Parts of
Speech tags to bootstrap the simplest case of token level schema definitions, a sentence is passed
through all the three components incrementally until all the words are exhausted and the entire
sentence is analysed as an instance of one final construction schema. The order in which all intermediate
schemas are assembled to form the final schema can be viewed as the parse of the sentence. Parsers
for English and Welsh (a low-resource minority language) were developed using the same approach with
some changes to the Schema Definition component. We evaluated the parser performance by (a)
Quantitative evaluation by comparing the parsed chunks against the constituents in a phrase structure
tree (b) Manual evaluation by listing the range of linguistic constructions covered by the parser and by
performing error analysis on the parser outputs (c) Evaluation by identifying the number of edits
required for a correct assembly (d) Qualitative evaluation based on Likert scales in online surveys
Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
Understanding the sentiment of a comment from a video or an image is an
essential task in many applications. Sentiment analysis of a text can be useful
for various decision-making processes. One such application is to analyse the
popular sentiments of videos on social media based on viewer comments. However,
comments from social media do not follow strict rules of grammar, and they
contain mixing of more than one language, often written in non-native scripts.
Non-availability of annotated code-mixed data for a low-resourced language like
Tamil also adds difficulty to this problem. To overcome this, we created a gold
standard Tamil-English code-switched, sentiment-annotated corpus containing
15,744 comment posts from YouTube. In this paper, we describe the process of
creating the corpus and assigning polarities. We present inter-annotator
agreement and show the results of sentiment analysis trained on this corpus as
a benchmark
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text
This paper describes the development of a multilingual, manually annotated
dataset for three under-resourced Dravidian languages generated from social
media comments. The dataset was annotated for sentiment analysis and offensive
language identification for a total of more than 60,000 YouTube comments. The
dataset consists of around 44,000 comments in Tamil-English, around 7,000
comments in Kannada-English, and around 20,000 comments in Malayalam-English.
The data was manually annotated by volunteer annotators and has a high
inter-annotator agreement in Krippendorff's alpha. The dataset contains all
types of code-mixing phenomena since it comprises user-generated content from a
multilingual country. We also present baseline experiments to establish
benchmarks on the dataset using machine learning methods. The dataset is
available on Github
(https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo
(https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page
Corpus creation for sentiment analysis in code-mixed Tamil-English text
Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis
of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos
on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they
contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a
low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English
code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of
creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on
this corpus as a benchmarkThis publication has emanated from research supported
in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289
(Insight), SFI/12/RC/2289 P2 (Insight 2), co-funded by
the European Regional Development Fund as well
as by the EU H2020 programme under grant agreements 731015 (ELEXIS-European Lexical Infrastructure), 825182 (Pret- ˆ a-LLOD), and Irish Research Council `
grant IRCLA/2017/129 (CARDAMOM-Comparative Deep
Models of Language for Minority and Historical Languages).non-peer-reviewe
Corpus creation for sentiment analysis in code-mixed Tamil-English text
Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis
of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos
on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they
contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a
low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English
code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of
creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on
this corpus as a benchmarkThis publication has emanated from research supported
in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289
(Insight), SFI/12/RC/2289 P2 (Insight 2), co-funded by
the European Regional Development Fund as well
as by the EU H2020 programme under grant agreements 731015 (ELEXIS-European Lexical Infrastructure), 825182 (Pret- ˆ a-LLOD), and Irish Research Council `
grant IRCLA/2017/129 (CARDAMOM-Comparative Deep
Models of Language for Minority and Historical Languages)